Discrete Distribution Clustering in Big Data and a Method for Storm Prediction Leveraging Large
نویسندگان
چکیده
Big data brings new challenges and opportunities in many scientific areas today. Characterized by the high volume, velocity, and variety (3Vs) model, big data is valuable in many knowledge discovery applications, whereas requires new methodologies and technologies to manage and make use of the data. In this dissertation, a fundamental methodology and an emerging application of big data are presented. First, the parallel discrete distribution (PD2) clustering algorithm is designed and implemented. Discrete distributions are well adopted data signatures in information retrieval and machine learning, and discrete distribution (D2) clustering is a fundamental methodology. However, the high computational complexity of D2-clustering limits its impact on massive learning problems. PD2-clustering with substantially improved scalability facilitates unsupervised learning in many big data applications. Extensive analysis and experiments are presented to demonstrate the effectiveness and advantages of PD2-clustering. Second, satellite image analysis for storm forecasting is explored as an application of big data in meteorology. A large amount of historical satellite images and storm report archives are mined to predict storms. The proposed algorithm extracts visual storm signatures from satellite image sequences in a way similar to how meteorologists interpret them, and incorporates past meteorological records to model and classify the signatures. Such a big-data-driven approach aims at overcoming the intrinsic numerical instability of the conventional weather forecasting approach based on physical numerical models, and serves as a new component in a weather forecasting system. Experimental results in both studies show the benefits of leveraging big data in multiple areas.
منابع مشابه
BIG BANG – BIG CRUNCH ALGORITHM FOR LEAST-COST DESIGN OF WATER DISTRIBUTION SYSTEMS
The Big Bang-Big Crunch (BB–BC) method is a relatively new meta-heuristic algorithm which inspired by one of the theories of the evolution of universe. In the BB–BC optimization algorithm, firstly random points are produced in the Big Bang phase then these points are shrunk to a single representative point via a center of mass or minimal cost approach in the Big Crunch phase. In this paper, the...
متن کاملRobust DEA under discrete uncertain data: a case study of Iranian electricity distribution companies
Crisp input and output data are fundamentally indispensable in traditional data envelopment analysis (DEA). However, the real-world problems often deal with imprecise or ambiguous data. In this paper, we propose a novel robust data envelopment model (RDEA) to investigate the efficiencies of decision-making units (DMU) when there are discrete uncertain input and output data. The method is based ...
متن کاملDesign and Test of the Real-time Text mining dashboard for Twitter
One of today's major research trends in the field of information systems is the discovery of implicit knowledge hidden in dataset that is currently being produced at high speed, large volumes and with a wide variety of formats. Data with such features is called big data. Extracting, processing, and visualizing the huge amount of data, today has become one of the concerns of data science scholar...
متن کاملUsing Combined Descriptive and Predictive Methods of Data Mining for Coronary Artery Disease Prediction: a Case Study Approach
Heart disease is one of the major causes of morbidity in the world. Currently, large proportions of healthcare data are not processed properly, thus, failing to be effectively used for decision making purposes. The risk of heart disease may be predicted via investigation of heart disease risk factors coupled with data mining knowledge. This paper presents a model developed using combined descri...
متن کاملPrediction of Electrofacies Based on Flow Units Using NMR Data and SVM Method: a Case Study in Cheshmeh Khush Field, Southern Iran
The classification of well-log responses into separate flow units for generating local permeability models is often used to predict the spatial distribution of permeability in heterogeneous reservoirs. The present research can be divided into two parts; first, the nuclear magnetic resonance (NMR) log parameters are employed for developing a relationship between relaxation time and reservoir poro...
متن کامل